Exploring Red and White Wine Quality

by Adan Olivera


Abstract

This is an exploratory data analysis of quality of red and white wines by their physicochemical properties such as alcohol level, density and pH. The goal is to find which physicochemical properties correlate to wine quality. We first explore the distribution of individual variables, then their relationships and finally summarise the findings.

This dataset lists 1599 and 4898 instances of red and white wine respectively, each containing 11 variables on the chemical properties of the wine and a quality score based on sensory data - median of at least 3 evaluations made by wine experts.

From the 11 variables in the dataset, it was found that only alcohol had a moderate correlation to quality in both red and white wines. For red wines, acetic acid concentration had also demonstrated a weak correlation to quality.


Introduction

Wine industry is a lucrative industry which is growing as wine is getting more popular and information about it is more widely acessible. There are many factors that may affect the perceived taste and quality of wine. Among these factors, physicochemical properties of the wine, such as alcohol and sugar levels, pH and chlorides may play an important role

Here, we’ll analyse a dataset related to red and white variants of the Portuguese “Vinho Verde” wine, exploring which (if any) chemical properties influence their quality.

This dataset lists 1599 and 4898 instances of red and white wine respectively, each containing 11 variables on the chemical properties of the wine and a quality score based on sensory data - median of at least 3 evaluations made by wine experts. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). Below, we describe each of the attibutes registered in each instance:

1 - fixed acidity (tartaric acid - g / dm^3): most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity (acetic acid - g / dm^3): the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid (g / dm^3): found in small quantities, citric acid can add ‘freshness’ and flavor to wines

4 - residual sugar (g / dm^3): the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides (sodium chloride - g / dm^3): the amount of salt in the wine

6 - free sulfur dioxide (mg / dm^3): the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide (mg / dm^3): amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density (g / cm^3): the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates (potassium sulphate - g / dm3): a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

11 - alcohol (% by volume): the percent alcohol content of the wine

12 - quality (score between 0 and 10)

It’s important to note: several of the attributes may be correlated (e.g. density and alcohol level).


The Dataset

Preparation

To start our analysis, let’s firts load up all necessary packages and our dataset. I’ll create a dataset called “wines” from two separate datasets, one for red wines and the other one for white wines. I’ll also set quality as an ordered factor variable.

Summary of the Dataset

I already covered in the Indroduction all the features in this dataset, but here we have more details about its structure with a list of its columns, the data type of each one of them and some sample values. We have a total of 6497 observations with 14 variables (of which only 11 are physicochemical).

List of features

## 'data.frame':    6497 obs. of  14 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
##  $ type                : chr  "red" "red" "red" "red" ...

Summary of the values in each feature

##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.: 813   1st Qu.: 6.400   1st Qu.:0.2300   1st Qu.:0.2500  
##  Median :1650   Median : 7.000   Median :0.2900   Median :0.3100  
##  Mean   :2044   Mean   : 7.215   Mean   :0.3397   Mean   :0.3186  
##  3rd Qu.:3274   3rd Qu.: 7.700   3rd Qu.:0.4000   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :15.900   Max.   :1.5800   Max.   :1.6600  
##                                                                   
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  1.00     
##  1st Qu.: 1.800   1st Qu.:0.03800   1st Qu.: 17.00     
##  Median : 3.000   Median :0.04700   Median : 29.00     
##  Mean   : 5.443   Mean   :0.05603   Mean   : 30.53     
##  3rd Qu.: 8.100   3rd Qu.:0.06500   3rd Qu.: 41.00     
##  Max.   :65.800   Max.   :0.61100   Max.   :289.00     
##                                                        
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.: 77.0        1st Qu.:0.9923   1st Qu.:3.110   1st Qu.:0.4300  
##  Median :118.0        Median :0.9949   Median :3.210   Median :0.5100  
##  Mean   :115.7        Mean   :0.9947   Mean   :3.219   Mean   :0.5313  
##  3rd Qu.:156.0        3rd Qu.:0.9970   3rd Qu.:3.320   3rd Qu.:0.6000  
##  Max.   :440.0        Max.   :1.0390   Max.   :4.010   Max.   :2.0000  
##                                                                        
##     alcohol      quality      type          
##  Min.   : 8.00   3:  30   Length:6497       
##  1st Qu.: 9.50   4: 216   Class :character  
##  Median :10.30   5:2138   Mode  :character  
##  Mean   :10.49   6:2836                     
##  3rd Qu.:11.30   7:1079                     
##  Max.   :14.90   8: 193                     
##                  9:   5

The median fixed acidity is 7 g/L and the median volatile acidity is 0.29 g/L. Median citric acid contration is 0.31 g/L and the median pH is 3.2, with all wines being acid (pH < 4). Most wines in this samples aren’t considered sweet, with 75% of them having a residual sugar concentrations below 8.1 g/L. Median Chloride (salt) concentration for these wines is 0.05 g/L. Sulphates concentration is 0.53 g/L on average and total sulfur dioxide and its free version medians are 118 ppm and 29 ppm, respectively. Density varies very little, ranging from 0.98 to 1.03 g/mL, being on average 0.99 g/mL. The mean alcohol volume is 10.49 % and most wines have a quality rating of 5 and 6, with few of them getting a rate of less than 5.


Analysis

Exploring Individual Variables

We’ll begin by exploring the distribution and patterns of individual variables and how their distribution varies across types of wine. First, we’ll generate plots and then discuss the patterns and interesting finds.

For some variables such as chlorides, density and total sulfur dioxide I had to exclude outliers from plots, since they usually created distortations that made it harder to spot overall traits.

As quality is our main feature of interest, we’ll start with it and then explore all other features in order of expected relevence for the differences in quality rating.

Quality of Wine

Summary of Quality in Red Wine

##   3   4   5   6   7   8   9 
##  10  53 681 638 199  18   0

Summary of Quality in White Wine

##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

Most red wines were rated 5 with a minimum rating of 3 and maximum of 8. White wines had generally better ratings, with most of them being rated 6 and 5 of them being rated 9, the highest rating among this group. 67% of white wines have ratings greater than 5 in comparison with only 53% of red wines.

Quality rating distribution appears to be normal for both red and white wines, though the distribution for red wines seems to be wider. Here we can clearly see that most wines in the sample are white.

Alcohol Level

Summary of Alcohol in Red Wine

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Summary of Alcohol in White Wine

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Alcohol concentration distribution is very similar for red and white wines. Both range from about 8% to 14% of volume and appear to be bimodal, with a peak below 10% (about 9.5%) and a smaller one above 10% (about 11% for reds and 12.5% for whites). Alcohol distribution seems wider for white than for red wines, and the white’s mean is slighly higher than red’s, being 10.51% and 10.41% respectively. Reds have 3 outliers above 13.5% while whites have no outliers.

Residual Sugar

Summary of Sugar in Red Wine

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

Summary of Sugar in White Wine

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

White wines have more sugar than red wines on average. The mean residual sugar concentration for white is 6.4 g/L while for red is 2.5 g/L, almost 3x less. And the max concentration for white, 65.8 g/L is much higher than for red, 15.5 g/L. One interesting fact is that the median and 3rd quantile of red wines are very near each other, being 2.2 g/L and 2.6 g/L respectively.

Residual Sugar distribution differs greatly for red and wine whites. While both appear positively skwed, red distribution is concentrated around 2 g/L and white is much wider, with a greater diversity of sugar levels. With the boxplots specially, we can notice the highly concentrated distribution of red wines when compared to white wines, which has a interquantile range almost 12x wider than red wines.

Chlorides

Summary of Chlorides in White Wine

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Summary of Chlorides in White Wine

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

Average chloride concentration is higher in red wines, being almost double the concentration in white wines - 0.087 g/L against 0.045 g/L.

The concentration for both wine types appear normally distributed. Through the histogram and density plot we can notice how the distribution of red presents a higher mean, as it’s shifted to the right of the white distribution. Looking at the boxplots, we can notice that the distribition of red wines is slightly wider than the distribution of white wines.

Density

Summary of Density in Red Wine

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

Summary of Density in White Wine

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

Density varies very little among wines, with most of them having a density between 0.98 g/mL and 1 g/ml.

Density distribution seems normal and more concentrated for red wines while positively skewed and wider for white wines. The distribution for reds has only one noticeable peak, while it seems trimodal for whites - with one peak for each peack in alcohol level, maybe.

Acidity

Fixed Acidity, which is given by the concentration of tartaric acid is normally distributed in both wine types, and peaks at about 7 g/L. Its distribution for red wines is a bit higher, wider and more skewed than for white wines.

Volatile Acidity, which is given by acetic acid concentration, is generally lower (mean at about 0.25 g/L) and normally distributied (with a posive skew) for white wines. The distribution for red wines seems wider and bimodal.

Citric acid concentration varies around 0.3 g/L and presents a trimodal distribution for both wine types. While almost all white wines contain some level of citric acid, a significant portion of red wines have a 0 g/L.

pH for both wine types is normally dstributed around a mean of about 3.25. The distribution for red wines is slightly higer than for white wines.

Sulfur

Sulphates concentration is very similar to both red and white wines, both normally distributed and with peaks between 0.4 and 0.6 g/L. The only noticeable difference is that concentration is generally lower for white wines.

Total and and Free Sulfur Dioxide concentration distribution is more positively skewed for red wines, while white wines have a wider normal-like distribution. Free sulfur dioxide distribution presents lower values than total sulfur dioxide distribution, which makes sense since the free form derives from the total gas concentration. One important difference between types is that total and free sulfur dioxide concentration is generally higher for white wines. Most of them have a concentration that makes the gas evident in the nose and taste of wine (above the threshold of 50 ppm), while most red wines don’t.

One interesting fact with these variables is that they present many outliers, with values as high as about 3x the mean.

Univariate Analysis Highlights

From the analysis of individual variables distributions, the most relevant facts I found were:

  • the distribution of quality rating, which was highly concentrated on 5 and 6 and positively biased, as there were far more wines rated above 5 than below it. No red wines were rated more than 8 and the maximum rating for white wines was 9. This also indicates the bias from wine judges, since it seems that a wine needs to be extremely extraordinary and surprising to reach the higher grade, which they may have associated with perfection.
  • the wide distribution of alcohol, which ranges from about 9% to 13%; I expected to see less variation with a high peak at about 12%;
  • the difference of sugar distribution between red and white wines. Red wines had very little variation in sugar level, while white wines had a much wider and diverse distribution;
  • the level of salt in red wines is almost double the level in white wines;
  • density is generally below 1g/ml and it veries very little among red and white wines;
  • most wines have pH in the 3 to 4 range, with white wines being slightly more acid than red ones;
  • white wines generally have more S02 than red wines, maybe due to its higher sugar concentration.

Exploring Relashionships

Now we’ll explore relationships between the features in the dataset. I’ll look at relashionships pairs of variables and start by the relationship of quality with other variables.

Paired plots of features in the Dataset

Based on the paired plots above, I decided to have a look at bivariate plots of the variables that seemed most correlated with quality, so we could get a better defition of their relationships.

Exploring Correlations with Quality Rating

As quality is an ordered factor, I decided to create boxplots of the variables as functions of quality rating, and to add a line connecting the median values for each quality rating - so we can see trends.

Alcohol Level

Here we can see the relationship between alcohol level and quality. Though not through a straight line, we can notive that quality is generally positively correlated to alcohol level for both red and white wines. Alcohol seems to be the variable with the greates correlation to quality.

Volatile Acidity

Here, we can notice that quality has apparently a strong negative correlatation with volatile acidty in red wines. Quality rating doesn’t seem to be affected by volatile acidity in white wines, though.

Citric Acid

Citric acid seems to be positively correlated to quality but only in red wines as well. Citric acid concentration is stable across quality groups in white wines.

Salt (Chlorides)

In the boxplots above, we can see a small but noticeable negative correlation between quality and salt in red wines, and a stronger negative correlation in white wines.

Density, pH and Sulphates

In the first boxplot row above, we can see the relationship between density and quality. We can spot a negative correlation between desity and quality in both wine types, but a stronger one for white wines.

The second row shows us how quality varies with pH. Here, an interesting tred emerges: though not strongly, both wine types seem to correlate with pH. But while for red wines the correlatation seems negative, for white wines it seems positive.

And lastly, we have the relationship between sulphates and quality. Although potassium sulphate concentration doesn’t vary a lot, we can notice a small increase in its concentration for higher quality scores in red wines.

5 features with the highest correlation to quality in Red Wines

##                      volatile.acidity citric.acid total.sulfur.dioxide
## volatile.acidity            1.0000000 -0.55249568           0.07647000
## citric.acid                -0.5524957  1.00000000           0.03553302
## total.sulfur.dioxide        0.0764700  0.03553302           1.00000000
## sulphates                  -0.2609867  0.31277004           0.04294684
## alcohol                    -0.2022880  0.10990325          -0.20565394
## quality                    -0.3905578  0.22637251          -0.18510029
##                        sulphates     alcohol    quality
## volatile.acidity     -0.26098669 -0.20228803 -0.3905578
## citric.acid           0.31277004  0.10990325  0.2263725
## total.sulfur.dioxide  0.04294684 -0.20565394 -0.1851003
## sulphates             1.00000000  0.09359475  0.2513971
## alcohol               0.09359475  1.00000000  0.4761663
## quality               0.25139708  0.47616632  1.0000000

Here we have the correlation matrix for the 5 features with the highest correlation to quality in red wines. Important facts to notice are:

  • alcohol level is the factor with the highest correlation with quality;
  • Volatile acidity has the second highest correlation with quality;
  • Sulphates has the 3rd highest correlation with quality;
  • Citric acid has a moderate correlation to volatile acidity and sulphates, as well as quality.

5 features with the highest correlation to quality in White Wines

##                      volatile.acidity   chlorides total.sulfur.dioxide
## volatile.acidity           1.00000000  0.07051157            0.0892605
## chlorides                  0.07051157  1.00000000            0.1989103
## total.sulfur.dioxide       0.08926050  0.19891030            1.0000000
## density                    0.02711385  0.25721132            0.5298813
## alcohol                    0.06771794 -0.36018871           -0.4488921
## quality                   -0.19472297 -0.20993441           -0.1747372
##                          density     alcohol    quality
## volatile.acidity      0.02711385  0.06771794 -0.1947230
## chlorides             0.25721132 -0.36018871 -0.2099344
## total.sulfur.dioxide  0.52988132 -0.44889210 -0.1747372
## density               1.00000000 -0.78013762 -0.3071233
## alcohol              -0.78013762  1.00000000  0.4355747
## quality              -0.30712331  0.43557472  1.0000000

This is same correlation matrix as before, but now for white wines. Important facts to notice are:

  • Generally all variables have weaker correlation to quality in white wines than in red wines;
  • Again, alcohol level is the factor with the highest correlation with quality;
  • Now, density appears among the top 5 correlate variables, with the 2nd highest correlation. But since the correlation between alcohol and density is very strong, we can consider that density correlation to quality is indirect, due to alcohol.
  • Salt has the 3rd highest correlation to quality in white wines, being negatively correlated.

Exploring indirect correlations

As we’ve seen the paired plots and in theses matrices, many variables in the dataset correlate to each other. Some correlations are obvious, such as fixed acidity and pH, but others are less obvious such as the one between residual sugar and sulfur dioxide.

In this section we explore some of these interesing indirect correlations.

Relationship between pH and acidity

Through the scatter plots above, we can see that pH has a strong negative correlation with fixed acididy, and with citric acid concentration as well, though not that strongly. That makes sense, since the more acids in the wine, lower tends to be its pH. But one initially surprising and counter intuitive correlation is the positive correlation between volatile acidity and pH. One would naturally expect that volative acidity would decrease pH, since we tend to expect that volatile acidity increases with total acidity. But the relashionship is the opposite, maybe because as volatile acidity increases more acid tends to be released from the liquid and therefore the liquid becomes less acid.

Relationships of Residual Sugar with other variables

As the scatter plots above show, residual sugar level is positively correlated to total and free sulful dioxide. As sulfur dioxide preventis microbial growth and the oxidation of wine, it makes sense for it to be in greater concentration in wines most vunerable to microbial growth, which I assume to be the wines with more sugar.

Residual sugar also strongly correlates with density. As sugar level increases, so does density. It makes sense, as the more sugar is diluted in the liquid, heavier it tends to get without increase in volume.

Relationships of Alcohol with other variables

As the first scatter plot in this sequence shows, for white wines, residual sugar is negatively correlated to alcohol level. It makes sense if we consider that as alcohol prodcution through fermentation cosumes sugar, we’d have less sugar left where more alcohol was produced. In red wines, we see a very small increase in sugar level with higher alcohol levels. That seems to counter the facts mentioned before. But as the residual sugar range in red wines is very thin, I assume that red grapes generally have less sugar than white grapes, which forces producers to consume the maximum amount of sugar to produce wines and choose sweeter grapes to make more alcohol wines.

Alcohol also seems to be negatively related to salt level, though I can’t find a logic explation for it.

Total sulfur dioxide appears to be negatively correlated to alcohol both in white and red wines. This relatioship might be indirect, due to sugar. I’ve seen that more alcohol tend to lead to more sugar, which then tends to lead to sulfur dioxide.

Finally, we can see the strongest correlation found in the dataset: negatively between alcohol and density. As alcohol concentration increases, density decreases. That’s because alcohol is less dense than water and its share of volume increases, it tends to decrease the mean desity of the whole solution.

Relationships Analysis Highlights

From the analysis of bivariate relationhips in this dataset, what I consider the most important findings are:

  • alcohol level is the feature with the highest correlation to quality for both red (r2 = 0.48) and white (r2 = 0.44) wines;
  • Volatile acidity (-0.39), citric acid (-0.23) and sulphates (-0.25) also have a weak but noticeable correlation to quality in red wines;
  • volatile acidity (-0.19) and chlorides (-0.21) have also a weak correlation to quality in white wines;
  • features generally have a lower correlation to white wines than to red wines;
  • pH surprisingly had opposite effect on quality in red wines and white wines - but a weak effect;
  • an interesting and unexpected relationhsip between sugar and sulfur dioxide was found;
  • alcohol was negatively correlated to chlorides;
  • the strongest relationship found in this dataset was between alcohol and density, with a r2 of -0.78.

Exploring Multiple Variables

We’ve already uncovered the factors that were most correlated to quality in our dataset. But now, in order to see if there is some interesting relationship that could be uncovered by observing how quality varies along two other variables, we’ll plot the main features related to quality together and use color to distinguish their quality rating. I added elipses to the scatterplots that represent the 95% confidence interval of where datapoints were likely to fall for each quality rating.

Quality by alcohol Level and Density

Here we can see how both red and white wine quality varies with density and alcohol. As density decreases and alcohol level increases, quality tends to increase as well, as shown by the higher density of dark datapoints and the elipseses getting darker towards the bottom right of the plots.

We can also notice that red wines seem to be much less afted by density than white wines.

Quality by alcohol Level and Volatile Acidity

We can notice here that for red wines, quality tends to increase to the bottom right side of the chart. That indicates that quality tends to increase with alcohol level and decrease with volatile acidity in red wines. On the other hand, white wines seem to be almost unaffected by changes in the level of volatile acidity, varing just slightly negatively in relation to it.

Quality by Citric Acid and Chloride Concentration

Here we can observe that each wine type varies more notiably with only one of the two variables. Red wine quality increases slightly as citric acid concentration increases, as can be noticed by the increased density of darker datapoint upwards. And it looks like white wine quality increases as chloride concentration decreases, while keeping relatively unaffected by citric acid changes.

Quality by Sulphates and Total Sulfur Dioxide

Finally, in this plot we see that red wine quality increases with sulphates concentration and tends be slightly decreased by total sulful dioxide concentration. On the other hand, there’s no noticeable trend between white wine quality and either sulphates or total sulfur dioxide concentrations.

Linear Models for Quality Prediction

To sum up our analysis, I decided to build simple linear models to predict quality rating of red and white wines from the features with the highest correlation to it.

Red Wines

For red wines I used alcohol level and volatile acidity as independent variables because they’re the 2 features with the highest correlation to quality, and the correlation between them is relatively low. Moreover, R2 didnt’increase by adding other features to the model.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = red)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = red)
## 
## ==========================================
##                        m1         m2      
## ------------------------------------------
##   (Intercept)        1.875***   3.095***  
##                     (0.175)    (0.184)    
##   alcohol            0.361***   0.314***  
##                     (0.017)    (0.016)    
##   volatile.acidity             -1.384***  
##                                (0.095)    
## ------------------------------------------
##   R-squared              0.2        0.3   
##   adj. R-squared         0.2        0.3   
##   sigma                  0.7        0.7   
##   F                    468.3      370.4   
##   p                      0.0        0.0   
##   Log-likelihood     -1721.1    -1621.8   
##   Deviance             805.9      711.8   
##   AIC                 3448.1     3251.6   
##   BIC                 3464.2     3273.1   
##   N                   1599       1599     
## ==========================================

The variables in this linear model can account for 30% of the variance in the quality rating of red wines.

Predicions

Considering that quality is an integer, I decided to run the model rounding its predicted quality rating, so I could then compare it to the actual ratings. With this change, the model correctly predicted the quality of 906 instances from 1599 wines in the sample, yielding an accuracy rate of 57%.

##    Mode   FALSE    TRUE    NA's 
## logical     693     906       0
## [1] 0.5666041

White Wines

For white wines, I only used alcohol level to predict quality. When other features were added to the model, they made it worse, decreasing its R2.

## 
## Calls:
## m1: lm(formula = round(quality, digits = 0) ~ alcohol, data = white)
## 
## =============================
##   (Intercept)      2.582***  
##                   (0.098)    
##   alcohol          0.313***  
##                   (0.009)    
## -----------------------------
##   R-squared            0.2   
##   adj. R-squared       0.2   
##   sigma                0.8   
##   F                 1146.4   
##   p                    0.0   
##   Log-likelihood   -5839.4   
##   Deviance          3112.3   
##   AIC              11684.8   
##   BIC              11704.3   
##   N                 4898     
## =============================

And as we can see, alcohol in this linear model accounts for 20% of the variance in the quality rating of white wines.

Predicions

To test the predictive power of the model for white wines, I tweaked it the same way, rounding its output. In this way, the model correctly predicted the quality of 2398 instances from 4898 white wines in the sample, yielding an accuracy rate of 49%.

##    Mode   FALSE    TRUE    NA's 
## logical    2500    2398       0
## [1] 0.4895876

Multivariate Analysis Highlights

  • While looking at the multivariate plots and ploting alcohol relationship to quality along with other variables, it became evident that it’s the feature with greates impact on quality;
  • it became clear that white wine quality is more strongly impacted by density than red wines, even though both wine types are strongly impacted by alcohol, which itself impacts density.
  • by ploting two variables at the same time, it’s very easy to compare the different degrees of impact that each variables has on quality. As an examples, by plotting volatile acidity with alcohol level and quality, I was possible to notice volatile acidity’s impact on red wine quality, but it was possible to notice how weak this impact is when compared to alcohol’s impact;
  • even though some variables have demonstrated through calculations that they weakly correlate to quality, through multivariate scatterplots with quality as color it was very hard to spot any trends, as is the case of chlorides and total sulfur dioxide.

Final Plots and Summary

Plot One

Description One

The main goal of this exploratory analysis was to understand how quality of red and white wines is impacted by physicochemical properties such as residual sugar, chlorides, pH, alcohol level, and more. A surprising finding is that most properties in the dataset don’t correlate strongly enough to quality, except for alcohol level. Alcohol presented a moderate correlation to quality for the alcohol range in the sample, from about 9% to 14% of volume. As can be seen in this plot, generally speaking, as alcohol volume increase from about 9% to 14%, median quality ratings (red lines) also increase for both red and white wines - this trend is less clear to quality rating from 3 to 5. And as can be seen above as well through the boxplots for each quality score, the alcohol values in interquantile range also tend to increase with quality score, mainly for wines with the best scores (7-9).

The main question derived from this fact is: why? From what I’ve read about alcohol and quality, experts don’t seem to have an intentional preference for alcohol levels from 11% to 14%. And as their judgement tend to be based on other properties such as wine texture, color, region, and so on, I assume that white and red wines that tend to posses valued characteristics in those other proporties may also have an alcohol level in the 11% to 14% range. So maybe alcohol has a strong correlation to other properties external to this dataset that are more commonly associated to quality, and that’s why alcohol level tend to increase with quality rating (for the alcohol range in the sample).

Plot Two

Description Two

An important and interesting finding from the exploration is that the strongest relationships in the dataset were found among the features themselves instead of with quality. A striking example is the correlation between alcohol level and density, which has a coeficient of -0.78 for white wines. As can be notice in this scatterplot, wine density tends to decrease as alcohol level increases. This happens because alcohol’s density is lower than water’s density and therefore, as alcohol level increases, overall wine density tends to decrease. These “indirect” relationships were found among other features in the dateset such as citric acid and volatile acidity or fixed acidity and pH.

I see this as an indication of two possibilities:

  • the existence of confouding factors. Some of the physicochemical properties measured to form this dataset might be the result or input of the same chemical reaction, as an example. In this way, as they’re connected by the same process, their variation is likely connected;

  • bad features selection. Is it really useful to have variables that are clearly related, such as fixed acidity and pH, in the same dataset? or they’re redundant? If we’re to use them to predict quality score or understand what may cause a better quality score, having many variables that correlate to each other isn’t helpful.

Plot Three

Description Three

Another important trend found in the dataset is that the impact each feature has in quality differs significantly for red and white wines.

As can be seen in this plot, the darker elipses (which represent the 95% confidence interval of where wines with higher quality rating might appear) tend to move horizontally to the right for both wine types. This indicates that the quality of both wine types increases with alcohol level.

On the other hand, the darker elipses tend to move vertically only for red wines, which indicates that only red wine is impacted by Acetic acid concentration. As acetic acid concentration increases, quality distribution remains virtually uniform for white wines. On the other hand, quality for red wines increases as acetic acid concentration decreases.

This is a common phenomena in this dataset. Other properties such as citric acidity or chlorides concentration also impact red and white wines in different ways.


Reflection

In the beginning of the analysis it was interesting to discover how the distribution of physicochemical properties differed between red and white wines, specially the difference in residual sugar, sodium clhoride, density and total sulfur dioxide. One that was particularly intriguing is the tendency that white wines had to be better rated. 67% of white wines against 53% of red had ratings greater than 5. Does that mean experts preffer white wines? I don’t think we can quite say that, but this trend begs that question or leaves us wondering if white wine judges were the same for red wines and if they were biased towards white wines. One should expect such differences in distributions of red and white wine features, but it’s interesting to see them so tangibly.

While It was possible to create a linear model with accuracy rate of almost 60% for red wines and, more disappointingly, almost 50% for white wines, I felt frustrated to realize that most of the extreme ratings were not discovered (i.e. 3, 4 and 8, 9). And I ended up felling that these models might be just as good as guessing. Moreover, another frustrating fact is how few features correlated to quality. Only alcohol had a moderate correlation to quality in both wines, and the rest of the variables had weak or very weak correlations to quality. This was specially true for white wines, whose correlations were generally weaker than the ones found in red wines.

One fact that made me question the presence of some features in the dataset was the strong correlations some features had among themselves, such as alcohol and density or pH and fixed acidity. This indicates that some third missing variable might be causing the variation of these features and that some dependent variables (i.e. density) might be redundant.

Overall, I felt the feature set we had wasn’t that relevant for predicting quality rating. Experts judgement might indeed depend on taste, but it seems that the physicochemical variables in this dataset, exept from alcohol, weren’t strong enough to affect experts’s perception of taste. In addition to that, as we can assume from experience that the expert’s judgement is subjective, it may depend on other variables such as types of grape, production year, producer, country of origin, region, aging and price. I suspect that these variables are more strongly correlated to ratings than wine’s physicochemical properties. Nevertheless, it’s interesting to realize that you can still judge wine quality (to some extent) using physicochemical data without the need to actually taste it.

An interesting follow up project to this one would be to include the other variables mentioned above so we could create a predictive model of quality in red and white wines. This model could provide insights and guidance to data driven wine producers, so that they could manage their production to create wines with the right features to yield a high quality perception. That model would be a valuable asset and source of competitve advantage to wine producers who apply it.

References

N/A